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Abstract — Recovery of the sparsity pattern (or support) of a 
sparse vector from a small number of noisy linear projections (or 
samples) is a common problem that arises in signal processing 
and statistics. In this paper, the high-dimensional setting is 
considered. It is shown that if the sampling rate and per-sample 
signal-to-noise ratio (SNR) are finite constants independent of the 
length of the vector, then the optimal sparsity pattern estimate 
will have a constant fraction of errors. Lower bounds on the 
sampling rate needed to attain a desired fraction of errors are 
given in terms of the SNR and various key parameters of the 
unknown vector. The tightness of the bounds in a scaling sense, as 
a function of the SNR and the fraction of errors, is established by 
comparison with existing achievable bounds. Near optimality is 
shown for a wide variety of practically motivated signal models. 

Index Terms — compressed sensing, information-theoretic 
bounds, random matrices, random projections, regression, 
sparse approximation, sparsity, subset selection. 



I. Introduction 

Recovery of sparse or compressible signals from a lim- 
ited number of noisy linear projections is a problem that 
has received considerable attention in signal processing and 
statistics. Suppose, for instance, that a vector x of length n 
is known to have exactly k nonzero elements, but the values 
and locations of these elements are unknown and must be 
estimated from a set of m noisy linear projections (or samples 
U) of the form 



y i = (0 J ,x)+W i for i=l,---,m 



(1) 



where <p i are known sampling vectors, (■, •) denotes the usual 
euclidean inner product, and W* is additive white Gaussian 
noise. Then, a key insight from sparse signal recovery is 
that the number of samples required for reliable estimation 
depends primarily on the number of nonzero elements, and is 
potentially much less than the length of the vector. 

One estimation problem of particular interest is to determine 
which elements of the the vector x are nonzero. This problem, 
which is refered to as sparsity pattern recovery in this paper, is 
known variously throughout the literature as support recovery 
or model-selection and has applications in compressed sensing 
JU-JU, sparse approximation |5j], signal denoising (6), subset 
selection in regression Q, and structure estimation in graph- 
ical models (8). 

A large body of work ]8l- l23l has considered exact recov- 
ery of the sparsity pattern by deriving necessary and sufficient 



conditions on scalings of the tuple (n,k n ,m n ) to ensure 
that the probability of exact recovery tends to one as the 
vector length n becomes large. It is now well understood 
that there exist two fundamentally different scalings depending 
on whether or not the samples are corrupted by noise: the 
noiseless setting requires m n = k n + 1 and the noisy 
setting requires m n = k n + 1 + C ■ k n log n where C is a 
constant (whose exact value is typically unknown but bounded) 
that depends on the signal-to-noise ratio (SNR) and various 
other assumptions about the values of the nonzero elements. 
Although these scaling results provide valuable insights for 
a variety problem settings and serve as a benchmark for the 
development of computationally efficient algorithms, they have 
two important limitations. 

The first limitation is that in many practically relevant 
settings, the cost (in terms of number of samples and SNR) 
of exact sparsity pattern recovery far exceeds the cost of other 
estimation tasks. For example, suppose that k n /n — > for 
some positive sparsity rate VL and m n /n — > p for some 
sampling rate p. Then, a central result from compressed 
sensing [24], J25l is that the vector x can be estimated with 
bounded mean squared error (MSE) even if the sampling rate p 
is finite (and possibly much less than one) and the per-sample 
SNR is a fixed value that does not depend on n. By contrast, 
the scaling results outlined above show that exact recovery of 
the sparsity pattern is not possible unless either the sampling 
rate p is infinite or the per-sample SNR increases without 
bound with n. If noise is due to quantization, this means that 
accurate estimation with respect to MSE requires only fixed 
bit-rate whereas exact recovery of the sparsity pattern requires 
an unbounded bit-rate. 

The second limitation is that scaling results in terms of the 
dimensions (n, k n ,m n ) do not tell the whole story. Often, one 
needs to know the exact constants involved in the bounds and 
the dependence of these constants on parameters such as the 
SNR or various assumptions about the vector x. For many 
of the estimation tasks considered throughout the compressed 
sensing literature, these properties are not well understood. 
As a result, the majority of sufficient conditions are far more 
conservative than those suggested by empirical evidence, and 
the optimality (or gap from optimality) of existing algorithms 
is difficult to determine due to the potential looseness of the 
necessary conditions. 

In the present work, we derive bounds for approximate 
recovery of the sparsity pattern. In particular, we derive lower 
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bounds on the number of samples required to ensure that the 
sparsity pattern can be estimated with no more than a ■ k n 
errors for some error rate a. Corresponding upper bounds are 
derived in the companion paper l26l . An example of these 
bounds is shown in Figure Q] 
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Fig. 1 . Bounds on the asymptotic sampling rate p = m/n and SNR required 
to identify the locations of at least 90% of the nonzero elements of a vector 
x G R n with sparsity rate f2 = k/n = 10 -4 when the power of the smallest 
nonzero element is at least 20% of the average power of the nonzero elements. 
The upper bounds correspond to the Nearest Subspace (NS) and Thresholding 
(TH) estimators analyzed in 1261 . 

The contributions of this paper directly address the lim- 
itations of the scaling results for exact recovery outlined 
above. With respect to the first limitation, we show that if the 
sampling rate and SNR are both finite, then the error rate a 
must be a positive constant independent of n. In other words, 
recovery up to an asymptotically vanishing fraction of errors is 
as hard as perfect recovery, a gap left open by previous work. 
This means that in order to have the same scaling behavior 
as in the MSE recovery problem, one must consider sparsity 
pattern recovery subject to a constant fraction of errors. 

With respect to the second limitation, our lower bounds are 
derived with an explicit dependence on various key problem 
parameters such as the SNR, the sparsity rate, and the relative 
size of the smallest nonzero elements. These bounds allow 
us to consider a wide variety of problem settings where the 
unknown vectors may be deterministic or stochastic and the 
magnitude the smallest nonzero element may tend to zero as 
the vector length becomes large. Our framework allows us to 
address a number of important questions: 

• What is the effect of prior information? The upper bounds 
in l26l correspond to estimators that know the exact 
number of nonzero elements, but have no prior informa- 
tion about their values. The lower bounds in this paper 
apply to settings where the estimator may know statistical 
information such as the average power, range of values, 
or distribution. Interestingly, the resulting bounds show 
that in many cases, this additional knowledge does not 
significantly improve the ability to estimate the sparsity 
pattern. 

• What happens as the desired error rate tends to zero? 
Our bounds show that the sampling rate depends on 
the inverse of the error rate 1/a. If the magnitudes of 



the nonzero elements have a fixed lower bound that is 
independent of n then this dependence is logarithmic. 
Otherwise, the dependence is polynomial. 

• How does recovery depend on the SNR? We show that 
the sampling rate p must scale like l/log(l + SNR). In 
particular, our bounds show that performance is domi- 
nated by the size of the smallest nonzero elements at low 
SNR and by the entropy of the nonzero elements at high 
SNR. 

• What happens in the noiseless setting? It is straightfor- 
ward to see that p > O is a lower bound whenever the 
nonzero values are unconstrained. By contrast, when the 
nonzero values are drawn from a discrete and finite set 
(known to the estimator), then p > is the best universal 
lower bound. We show that if the nonzero values are 
drawn from a known distribution with sufficiently large 
(differential) entropy, then the condition p > O remains 
necessary. 

This paper is organized as follows: Section [II] gives the 
precise problem formulation. Section [III] gives information- 
theoretic lower bounds on the sampling rate for the noiseless 
setting. Section HV] gives corresponding bounds for the noisy 
setting. Section [V] compares the scaling properties of these 
lower bounds with those of the upper bounds in l26l . Section 
IVll provides specific examples and illustrations, and proofs 
are given in the Appendices. The following section provides 
a brief, and necessarily incomplete, overview of work related 
to this paper. 

A. Related Work 

One line of related research has focused on the design 
and analysis of computationally efficient algorithms for sparse 
signal approximation Q-g), 0, ED, |l24l-ll36l. A key 
theoretical result ll24l . ll25l is that any fc-sparse vector x of 
length n can be approximated with bounded mean squared 
error j|x — xj| 2 /n < Ci/SNR using m = \C% ■ fclog(n/fc)] 
samples and a quadratic program known as Basis Pursuit 
J6l where C\,Ci are finite constants. In the absence of any 
sampling noise, this result guarantees exact recovery. In the 
presence of noise, however, bounds on the mean squared error 
are insufficient to determine the accuracy of the estimated 
sparsity pattern. Work focusing directly on recovery of sparsity 
pattern (8), lfl2l . |[T3l has derived various sufficient conditions 
for exact recovery using a particular convex relaxation known 
in the statistics literature as the Lasso l30l . 

In conjunction with the results outlined above, another 
line of research has focused on the fundamental limitations 
of sparse signal approximation that apply to any algorithm, 
regardless of computational complexity. For the special case 
of exact recovery in the noiseless setting, these limitations 
have been well understood: recovery of any fc-sparse vector 
requires exactly m = 2fc samples for deterministic guarantees 
and only m = k + 1 samples for almost sure guarantees ||9l- 
ifTTI . regardless of the vector length n. In both cases, recovery 
corresponds to an NP-hard l37l exhaustive search through all 
possible sparsity patterns. In Section [III] of this paper, we 
address the extent to which an even smaller number of samples 
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are needed when there exists prior knowledge about the vector 
x, or when only partial recovery is needed. 

Although the noiseless setting provides insight into the 
limitations of sparse approximation that cannot be overcome 
simply by increasing the SNR, consideration of the noisy 
setting is crucial for cases where noise is intrinsic to the 
problem or where real-valued numbers are subject to rate con- 
straints. From an information-theoretic perspective, a number 
of works have studied the rate-distortion behavior of sparse 
sources I38l - l43l . Most closely related to this paper, however, 
is work that has addressed sparsity pattern recovery directly. 
An initial necessary bound based on Fano's inequality was 
provided by Gastpar IfTTl who considered Gaussian signals 
and deterministic sampling vectors. Necessary and sufficient 
scalings of (n, k, m) were given by Wainwright lfl4l who 
considered deterministic vectors, characterized by the size 
of their smallest nonzero elements, and Gaussian sampling 
vectors. Wainwright's necessary bound was strengthened by 
Reeves |15), for the special case where k scales proportionally 
with n, and for general scalings by Fletcher et al. 1201 . and 
Wang et al. <2"T1 . 

A number of papers have also addressed extensions to ap- 
proximate recovery: necessary and sufficient conditions were 
provided by Aeron et al. |[T6l . IfTTl for the special case of 
discrete vectors, and by Akcaya and Tarokh |[T8l and Reeves 
fTSl for general vectors. The results given in Section [TV] of 
this paper significantly strengthen the previous necessary con- 
ditions in the ways described in the introduction. Comparable 
improvements to the previous sufficient conditions are given 
in the companion paper l26l . 

II. Problem Formulation 

In this paper, we assume that x is an arbitrary (non-random) 
element from some subset X n C M™. The sparsity pattern 
s C {1, 2, • ■ • , n} is the set of integers indexing the nonzero 
elements of x, 

s := {i : Xi ^ 0}, 

and the sparsity k = |s| is the number of nonzero elements. 

We assume that x is sampled using the noisy linear observa- 
tion model given in (JJi. In matrix form, the vector of samples 
Y = [Yi, ■ ■ ■ , Y m ] T can be expressed as 

Y = Ax + W 

where the sampling matrix A G jjmxn jj as rows ,pj an( j 
the noise vector W = [Wi,-- - ,W m ] T has i.i.d. standard 
Gaussian elements. We further assume that an estimator is 
given the set (Y, A, k), and the goal is to recover the sparsity 
pattern s of x. In some cases, additional information about the 
set X n is also provided. 

To quantify the distortion between a sparsity pattern s 
and its estimate s, it is important to observe that there are 
two different error events: one type of error occurs when an 
element in s is omitted from the estimate s and the other 
occurs when an element not present in s is included in s. In 
this paper, we focus on recovery at the point where there is an 
equal number of each error type. We assume throughout that 



any estimate s has the same size k as the true sparsity pattern 
s, and we define the distortion to be relative overlap 

d( s g) := i_L". 

N 

We say that recovery is successful with respect to distortion 
a G [0,1] if d(s,s, ) < a. Exact recovery corresponds to the 
case a = 0. 

We are interested in performance guarantees that hold 
uniformly for any x G X n . It is important to note, however, 
that for any particular sampling matrix A, there may exist a 
degenerate subset of X n for which recovery is particularly 
difficult. To overcome the effects of these sets, we allow A 
to be a random matrix (denoted using boldface) distributed 
independently of x. Given any sparsity pattern estimator 
s(y, A, k), the probability of error corresponds to the worst 
case x G X n with respect to the distribution on A, 

P e (n) = irrf n Pr{d(s,s(Y,A,jfc)) > a}. 

Estimation in the presence of noise depends critically on 
the size of the entries in the sampling matrix. In this paper, 
we assume that 

E[tr(AA T )] = to. (2) 

This scaling is consistent with the related work O, |4|, lfl5l . 
IfTTl . |[T9l and corresponds to the setting where each sampling 
vector (i.e. row of A) has unit magnitude. Thus, one useful 
property of this scaling, is that the SNR of the linear samples 
given in (UJ can be compared directly that of classical samples 
of the form Yi = Xi + Wi. Another useful property is that the 
SNR does not depend of the number of samples to. 

We caution the reader that various other scalings of the 
sampling matrix are also used in the literature, and thus extra 
care is needed when comparing results. For instance, in ifTSll . 
O, lf2"Tl each element of A has unit power, and the squared 
magnitude of each sampling vector is thus proportional to the 
vector length n. 

To characterize the number of samples that are needed, we 
consider the high dimensional setting where the vector length 
n becomes large. We use X to denote a sequence of subsets 
{X n } and refer to A" as a vector source. The main question 
we address is whether or not recovery is possible when the 
number of samples is given by m„ = \p ■ n\ for some finite 
sampling rate p that is a fixed constant independent of n. 

Definition 1. A sampling rate distortion pair (p, a) is said to 
be achievable for a source X if for each integer n there exists 
an estimator s(y, A, k) and a \p ■ n\ x n sampling matrix A 
such that 

P e (n) as n -> oo. 

The sampling rate distortion function p(a) is the infimum of 
rates p > such that the pair (p, a) is achievable. 

We focus exclusively on the scaling regime where the 
sparsity k scales linearly with the vector length n. 

Definition 2. Given any sparsity rate < O < 1/2, the set 
X n (Cl) consists of all vectors x G M. n with sparsity [ft ■ nj. 
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The vector source X(Q) denotes the sequence {X n (Cl)} for 
all n. 

From a sampling perspective, the sparsity rate VL measures 
the degrees of freedom per dimension of x and is analogous 
to the rate of innovation 11441 or "bandwidth" of an infinite 
length discrete time sequence. 

One limitation of the general source X(Vt) is that the 
nonzero values may be arbitrarily small, thus making recovery 
in the presence of noise impossible. In previous work lfl2l . 
0141 . this issue is addressed by placing a lower bound on the 
magnitudes of the nonzero elements of x. This paper uses 
the more general approach outlined below where the nonzero 
elements are characterized by a set of distribution functions. 

We define Tq to be the set of all cumulative distribu- 
tion functions (henceforth referred to simply as distributions) 
F(x) = Pr{X < x} for a random variable X such that 
EX 2 < oo and Pt{X = 0} = 0. 

Definition 3. Given any distribution F E J-q, the vector source 
X(Cl,F) consists of all sequences of vectors {x™ <G X n (fl)} 
for which sup xgR \F x n.^(x) — F(x)\ — > as n — > oo 
where F x n( s ) denotes the empirical distribution of the non- 
zero elements in x™, 



F x n( s )(x) = - ^l(xi < x). 



Given any subset T C T§, the vector source X(Cl,F) denotes 
the union VJp^jrX^l, F). 

To be consistent with previous work, we may for example 
consider the source X(Cl,F) where T denotes the set of 
all distributions whose support is bounded away from zero. 
However, one advantage of our approach is that we may also 
consider a source X(Cl,F) where F has a density around 
zero, and thus a small number of nonzero elements may be 
arbitrarily small. 

III. The Noiseless Setting 

In this section, we lower bound the achievable sampling 
rate distortion region in the absence of any measurement 
error. We use po(a) to denote the noiseless sampling rate 
distortion function. The results in this section give insight 
into the fundamental limitations of the sampling process that 
cannot be overcome simply by increasing the signal-to-noise 
ratio. These results also serve as a useful starting point for the 
noisy setting considered in Section ITvl 

We first consider the general vector source X(Cl). It is well 
known that exact recovery requires m = k + 1 samples and 
hence po(0) = Cl. However, if a distortion a > is allowed, 
then the following result shows that recovery using fewer 
samples is possible using a "rate sharing" strategy. The proof 
is given in Appendix I A-B I 

Theorem 1. The noiseless sampling rate distortion function 
Po(a) of the vector source X(fl) is given by 



The tradeoff between p and a exhibited in Theorem Q] 
requires that the sampling matrix has some subset of columns 
equal to zero. Since the number of such columns depends on 
a, it is not possible for a single random sampling matrix to 
uniformly achieve all the points in the achievable region. 

In contrast to the matrix constructions used in Theorem Q] 
a great deal of the work in compressed sensing has focused 
on matrices whose elements are independently and identically 
distributed with zero mean. We henceforth refer to these 
matrices as i.i.d. sampling matrices. The following result 
shows i.i.d. sampling matrices are optimal for exact recovery, 
but suboptimal for any nonzero distortion. The proof is given 
in Appendix IA-AI 

Proposition 1. If the sampling matrix is i.i.d., then the 
noiseless sampling rate distortion function po(a) of the vector 
source X(Q) is given by 



Po(a) 



n if a < i - n 

0, if a > 1 - CI 



(4) 



Next, we consider recovery for a vector source X(fl,F) 
characterized by a distribution function F. In some cases, the 
constraints imposed by F significantly alter the nature of the 
estimation problem. 

Proposition 2. Suppose that the distribution F is supported 
on a discrete and finite set T, C R\{0}. Then, m = 1 sample 
is sufficient for exact recovery, and the noiseless sampling rate 
distortion function po(a) of the vector source X(fl, F) is given 
by po(a) = for all a. 

Proof: Suppose that A is an 1 x n "matrix" whose ele- 
ments are drawn i.i.d. from continuous distribution with finite 
power. Then, with probability one, the projection x i-> Ax 
maps each of the possible realizations of x to a unique 

real number. ■ 
The fact that only one sample is needed for discrete distri- 
butions is not due to the sparsity in the problem (after all, the 
result does not depend on the sparsity rate f2) and Proposition!!] 
provides little insight into cases where the nonzero values 
are continuous. To address these cases, we use the following 
property. 

Definition 4. Given any sparsity rate ft and distribution F 
with mean pp, variance a 2 F and differential entropy h(F), the 
function 6(Q, F) <G [0, 1] is given by 

(2 7 re)- 1 exp(2/i(F)) 



8(Cl, F) 



(5) 



Po{oi) 



T=7T«> if«<l 



if a > 1 



(3) 



a 2 F + (l-0)/4 ' 

The quantity 0(il, F) measures the normalized entropy 
power of the source X(Q,, F) and is equal to one if and only 
if F is a zero mean Gaussian distribution. Roughly speaking, 
one may interpret 0(fl, F) as a relative "distance" between 
X(Vt, F) and a source characterized by a discrete distribution. 

Another property we use is the information rate (given in 
nats per dimension) required to encode a sparsity pattern to 
within distortion a. Although the following result is likely to 
exist elsewhere in the literature, a simple proof is given in 
Appendix IB-El 
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Lemma 1. Let Sq be set of all subsets of {1, 2, • • • , n} of 

size [fl ■ n\, and let N n (D,,a) denote the cardinality of the 
smallest subset S C such that for any s £ Sq there exists 
s' £ S satisfying <i(s, s') < a. Then, 



lim I log N n (n, a) = R(Q, a) 



Definition 5. The power and variance of a vector source 
^(fi, F) are given by 

P(fi, F) = + o^) and (10) 

(11) 



where 



mn a) = l H{n) ~ QH(a) ~ ^~^ H ^ a<l-Sl 
1 ' ' \o a> i-n 

(6) 



1/(0, F) = - fl)p 2 F + flap 

where \ip and op are the mean and variance of F. 

Due to the scaling of the sampling matrix given by (0, the 
power P(f2,P) represents the SNR of the samples, that is 

P(Q,P) = lira E|IAX " 2 



n^oo E W 



and H{p) = -plogp - (1 -p) log(l - p) is binary entropy. The variance is closely related and obeys 



Using the above properties, it is possible to give a nontrivial 
lower bound for i.i.d. sampling matrices and any distribution 
F with a density. The proof of the following result is given 
in Appendix IB-DI 

Theorem 2. If the sampling matrix is i.i.d., then a sampling 
rate distortion pair (p, a) is not achievable for the vector 
source X(Q,F) in the noiseless setting if p < f2 and 

1 aoo \ 



los 



< R(Sl,a) 



(7) 



2~~ a \e(n,F) A(p/n)J 

where 6(0,, F) is given by (0, i?(f2,a) is given by ©, and 



A(r) 



'(l-r) 1 - 1/r ifr<l 
1 if r = 1 



(8) 



The main intuition suggested by Theorem [2] is that the 
difficulty of recovering the support is related to the normalized 
entropy of the elements of x. Additionally, one consequence 
of Theorem [2] is that there is a simple test to see whether the 
sampling rate needed for a source X(il, F) is any less than 
for the general source X(Vt). 

Corollary 1 (Theorem |2). If the sampling matrix is i.i.d., 
then the noiseless sampling rate distortion function po (a) of 
the vector source X(ft,F) is given by po(a) = fl for all 
a < 1 — f2 such that 



9{n, F) > A(fi) cxp (-%R(Sl, a)) 



IV. The Noisy Setting 



(9) 



In this section, we lower bound the achievable sampling rate 
distortion region in the presence of additive white Gaussian 
noise. Unlike the noiseless setting considered in Section [HI] 
it is shown that recovery in the noisy setting depends sig- 
nificantly on the size the nonzero elements. We first derive a 
genie-aided lower bound that applies to any possible sampling 
matrix. We then derive stronger results for i.i.d. matrices. 

A. Bounds for Arbitrary Sampling Matrices 

To begin, we note that recovery for the general source 
X (ft) with distortion a < 1 — O is not possible in the noisy 
setting since the nonzero elements of x may be arbitrarily 
small with respect to the noise. Throughout this section, we 
focus exclusively on a source X(fl,F) characterized by a 
distribution function F and use the following properties. 



(i - fi)p(n, f) < v(n, f) < p(n, f) 



(12) 



with equality on the left when erf. = and equality on the 
right when \ip = 0. 

The following bound is general in the sense that it depends 
only on the average variance (or power) of the vector source. 
The proof is given in Appendix IB-AI 

Proposition 3. The sampling rate distortion function p(a) of 
the vector source X(fl,F) is lower bounded by 

2R(n,a) 



(13) 



iog(i + v(n,F)) 

where the functions •) and V(-, ■) are defined by (O and 
dill ) respectively. 

Versions of Proposition [3] have been derived previously for 
various signal models in the special case of exact recovery 
|[T"ill . fJT), ll40l . as well as for approximate recovery in the 
special case of binary signals ifTTll . Also, the techniques 
used to derive the result (standard information inequalities) 
extend readily to other distortion measures. For example, under 
Hamming distortion, Proposition [3] can be restated with the 
function JJ(f2, a) replaced by i?(0) — H(a). 

It is important to note, however, that Proposition [3] does 
not reflect the true difficulty of sparsity recovery with small 
distortions a. For example, for a = 0, the bound in Proposition 
[3] is finite even though it has been shown that an infinite sam- 
pling rate is needed lfT31l . Among other things, this discrepancy 
leaves open the possibility that the total number of recovery 
errors could grow sublinearly with the length n such that the 
fraction of errors is asymptotically zero. 

To overcome the shortcomings of Proposition [3] outlined 
above, it is useful to consider the smallest nonzero elements. 
We use the following definition. 

Definition 6. For any < j3 < 1, the /3-truncated distribution 
Fp of a distribution F is defined as 

Ff)(x) := Pr{X < x\Z = 0} (14) 

where X has distribution F and Z £ {0, 1} obeys 

fl, if\X\>tp 

PT{Z = l}=lpf), if\X\=tp 

[0, if\X\<t fi 
with tp and pp chosen such that Pr{Z = 0} = f3. 
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The /3-truncated distribution Fp characterizes the empirical 
distribution of the smallest (in magnitude) (3k nonzero ele- 
ments of x. For instance, if F(x) has a nonzero density that 
is fiat in a neighborhood around x = then Fp converges to 
a uniform distribution as (3 — >• 0. 

The /3-truncated distribution allows us to use the following 
argument: Suppose that F is continuous, and that for some 
fraction j3 £ (a, 1) a "genie" tells the estimator the values and 
locations of all the nonzero elements of x whose magnitude 
exceeds tp. Then, there remain approximately (3 ■ k unknown 
nonzero elements whose values are characterized by Fp. It can 
be shown that estimation of these elements is equivalent to the 
original problem with altered parameters, and maximizing over 
all possible (3 gives the following result. The proof is given in 
Appendix IB-El 

Theorem 3. The sampling rate distortion function p(a) of the 
vector source X(D,,F) is lower bounded by 

p(a) > max ; - 1 — (15) 

^ ' - a </3<i \og(i + v(pn,Fp)) 

where the functions ■) and V(-, ■) are defined by §6$ and 
(II lb respectively. 

In some cases, the right hand side of < TT~5T > is maximized by 
(3 = 1 and Theorem[3]is equivalent to Proposition [3] However, 
as a becomes small, the maximizing value of (3 is eventually 
less than one and Theorem [3] is much stronger. In particular, 
it is shown in Proposition [8] in Section [V] that the scaling of 
the bound (IT~5b as a — > is tight in the sense that it has 
the same scaling as the upper bounds given in |26l . For any 
distribution F, this scaling is at least log(l/a), and thus one 
consequence of Theorem [3] is that any estimator must have at 
least a (nonzero) fraction of errors if both the sampling rate 
and SNR are finite. 

As the SNR becomes large, the lower bound in Theorem [3] 
tends to zero at a rate proportional to 1/ log(l + P) regardless 
of the distribution F. For discrete distributions, this limiting 
behavior makes sense since the noiseless sampling rate distor- 
tion function is equal to zero (see Proposition O. However, the 
best known upper bound for sources with continuous valued 
elements scales like + Cj log(l + P) for some constant 
C € (0, oo). Thus, in general there is a disconnect between 
the upper and lower bounds in the high SNR setting. In the 
following section, we address this issue for the special case 
of i.i.d. sampling matrices. 

B. Bounds for I.I.D. Sampling Matrices 

This section derives improved lower bounds for matrices 
whose elements are independently and identically distributed 
with zero mean. These results are consistent with the noiseless 
bound (Theorem |2]i given in Section [HI] and provide a tight 
characterization of the high SNR scaling. 

One useful fact about about i.i.d. matrices is that their spec- 
trum (i.e. the empirical distribution of their singular values) 
converges as n becomes large to a deterministic density known 
as the Marcenko Pastur law B31 . This convergence allows us 
to more accurately describe certain aspects of our bounds. 



To begin, we present the following analog of Proposition [3] 
which serves as a building block for our further results. The 
proof is given in Appendix IB-B I 

Proposition 4. If the sampling matrix is i.i.d., then the 
sampling rate distortion function p(a) of the vector source 
X(Q,F) must satisfy 



g(p(a),T(n,F))>R(n,a) 



(16) 



where the functions •) and T(-, •) are defined by ©, and 
(II lb respectively, and 



G(r,i) 



rlog(l + 7-f(r,7)) 
+ log (1 + ry - £(r, 7)) - ^ (r, 7) 



(17) 



with e(r, 7) = \ (V7(v^+1) 2 + 1 - VTCV^lF+l) • 

The difference between Propositions [4] and [3] is the function 
Q(r, 7) which obeys G(r, 7) < rlog(l + 7). At high SNR, 
this difference is relatively small and thus Proposition |4] 
suffers both the low distortion and high SNR shortcomings 
of Proposition [3] 

The next result significantly strengthens Proposition |4] in 
the special case where F is Gaussian. The proof is given in 
Appendix IB -CI 

Proposition 5. Suppose that F is a Gaussian distribution with 
mean pp and variance op. If the sampling matrix is i.i.d., then 
the sampling rate distortion function p(a) of the vector source 
X(fl,F) must satisfy 

G(p{a), V(n, F)) > R(n, a) + O0(p(a)/f2, J2 <t 2 f ) (18) 

where the functions •), V(-, ■), and Q{-, ■) are defined by 
©, ( 1 1 lb . and ( 117b respectively. 

The term Q(p(a)/fl, Qa 2 ) on the right hand side of (fT8l 
increases with the variance a 2 . As a result, Proposition [5] 
is significantly stronger than the previously stated bounds in 
the high SNR setting. To extend these improvements to other 
distributions, we use the following property. 

Definition 7. The entropy power of a source X(£l, F) is 
defined by 



V h (Sl,F) =n£- e exp(2h(F)) 



(19) 



if F has a density where h(F) denotes differential entropy 
of the distribution F. If F does not have a density, then 

v h (n,F) = o. 

The entropy power obeys < Vh(£l,F) < V(Q,,F) with 
equality on the left if F does not have a density and equality on 
the right if and only if F is a zero mean Gaussian. Moreover, 
the ratio Vh(£l,F)/V(Q,,F) is equal to the function 6(0,, F) 
given in Definition |4] 

Using the entropy power, it is possible to give an improved 
lower bound for any distribution F with a density. The proof 
of the following result is given in Appendix IB-CI 
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Proposition 6. If the sampling matrix is i.i.d., then the 
sampling rate distortion function p(a) of the vector source 
F) must satisfy 

G{p(a), V(n, F)) > R(n, a) + nV(e$-,V h (Sl, F)) (20) 

where the functions •), V (•,•), <?(•,•) and Vh{-,-) are 
defined by ©, ( 1 1 lb . d 1 7b . and ( 119t respectively, and 

, . Ulog(l + ^A(r)) r<l 
[±log(l + ^A(±)) r >l 

with A(-) given by dHJ. 

The function V(r, 7) obeys V(r, 7) < C*(r, 7) with 
V(r, 7)/t/(r, 7) — > 1 as 7 — > 00, and thus the high SNR 
gains of Proposition [6] are comparable to those of Proposition 
[5] Taking the limit as V/j(fi, F) — > 00 recovers the bound for 
the noiseless setting given in Theorem [2] 

Additionally, the concavity of the logarithm gives V(r, 7) > 
min{r, 1}^ log (l + 7-). Applying this inequality to Propo- 
sition [6] gives the following simplified, although necessarily 
weaker, lower bound, 



2R(Q, a) + min{p(a), fi} log (l + ±F h (fi, F)) 



iog(i + v(n,F)) 



(22) 



Taking the limit as Vh(ft,F) —> 00 leads to the following 
simplified bound for the noiseless setting, 



1 + log (1/0(0, i?)) 



(23) 



Lastly, combining the low distortion improvement used in 
Theorem 1 with the high SNR improvement of Proposition [6] 
gives to the following bound. The proof is given in Appendix 

EH 

Theorem 4. If the sampling matrix is i.i.d., then the sampling 
rate distortion function p(a) of the vector source X(fl,F) 
must satsify 



(24) 



for all a < {3 < 1 where the functions •), V(-, ■), ■), 
V(-,-)> V h (;-), and Fp are defined by ©, (dB, dI3, (EB, 
(1191 , and (I14l l respectively. 

For the most part, Theorem|4]represents our strongest lower 
bound. Strictly speaking though, Proposition [5] may be slightly 
stronger than Theorem 0] in the special case where F is 
Gaussian and the maximization of d24T i occurs at /3 = 1. 



depend on only a few key properties of the source. These 
bounds allow us to address questions such as how p(a) 
increase as a becomes small and how p(a) converges to the 
noiseless rate po{ct) as the SNR becomes large. 

One key property of the source is the power P(fl,F). To 
describe scalings of the power we use X(fl, F; P) to denote a 
source characterized by a distribution F that is scaled to have 
power P. Another key property of the source is the following. 

Definition 8. The decay rate L £ [0, 00] of a distribution 
function F is defined as 

loge 



L := lim — . 

e^o log (F(e) - F(-e)) 



(25) 



if the limit exists. 



The decay rate L is independent of the power of the 
distribution F and characterizes the relative size the smallest 
nonzero elements drawn from a source X(£l, F). For instance, 
if X is a random variable with decay rate L < 00, and we 
define 

x e = inf {x > : Pr{\X\ < x} > e} , 

then eT L ■ x t — > c as e — >• for some c G (0, 00). The decay 
rate is L = if X is bounded away from zero and L = 00 if 
and only if Pt{X = 0} > 0. Thus, L is finite for any F G F Q . 

One useful property of the decay rate is that it can be used to 
bound the relative power of the /3-truncated distribution given 
in Definition [6] The proof of the following result is given in 
Appendix IC-AI 

Lemma 2. Given any distribution function F G Fa with decay 
rate L, there exist constants < Cp < Cp < 00 such that 



C- ■ f3 2L < ^"'ffi < C+ ■ (5 2L 



p(n,F) 



(26) 



for any < /3 < 1. 



Using the above properties, we are able to provide the 
following simplified version of Theorem [3] The proof is given 
in Appendix IC-BI 

Proposition 7. Given any distribution F G Fq, there exists 
a constant Cf > such that the sampling rate distortion 
function p(ct) of the vector source Af (O, F; P) is lower 
bounded by 



p(a) > Cf 



log (l + a 2L + x P) 



(27) 



for all distortions a G (0, 1/4) where F has decay rate L. 
Combining Proposition [7] with bounds from ll26l provides 



a tight characterization of the scaling behavior of p(a) as a 
becomes small. 

Proposition 8. Given any distribution F G Fq and sparsity 



rate O, there exist constants < C F q < 



CpQ < 00 such 



V. Scaling Behavior 

Although the bounds given in the previous two sections are 
computable, their complexity makes it difficult to understand 
how in a scaling sense the sampling rate distortion function 
p(a) depends on the distortion a or various properties of the 
source X(fl,F) such as the sparsity rate O or the power 
P(Cl,F). In this section, we derive simplified bounds that for all distortions a G (0, 1/4) where F has decay rate L. 



that the sampling rate distortion function p(ct) of the vector 
source X(Q, F) obeys 



C 



-2L 



F,n 



loe 



-2L 



log(i) (28) 
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Proposition [8] resolves a gap in the existing literature: It 
has been known that perfect sparsity pattern recovery in the 
presence of noise requires an infinite sampling rate, and it 
has also been known that for any fraction a > of errors, 
a finite sampling rate is sufficient. This leaves open the case 
of a vanishing fraction of errors. Proposition [8] shows that the 
latter is as hard as perfect recovery: again, an infinite sampling 
rate is needed. 

Next, we consider the setting of i.i.d. sampling matrices and 
provide a simplified version of Theorem |4] The proof is given 
in Appendix IC-C I 

Proposition 9. If the sampling matrix is i.i.d., then there exists 
a constant C > such that the sampling rate distortion func- 
tion p(a) of the vector source X(0,F;P) is lower bounded 
by 



p{a) >n + C' 



niog(i) 



log(l + P) 
for all distortions a £ (0, 1/4) that satisfy 



(29) 



9(0, F) > cxp [1 - ±R(0,a)) (30) 

where 6(0,, F) and R(0, a) are defined by (0 and (O 
respectively. 

Combining Proposition [9] with bounds from 11261 provides 
tight characterization of the scaling behavior of p(a) as a 
function of the SNR. 

Proposition 10. If the sampling matrix is i.i.d., then given 
any distribution F £ To, sparsity rate O, and distortion 
a £ (0, 1/4) that satisfy Inequality ( 130b . there exist constants 
< Cp n < CpQ < oo such the sampling rate distortion 
function p(a) of the vector source X(0, F; P) obeys 



C 



F,n,, 



< p(a) - O < 



C 



(31) 



log(l + P) ~^ ' " log(l + P) 

One may think of the difference p(a) — O as the excess 
sampling rate due to noise. Proposition [10] shows that if 
9(0,, F) is large relative to the distortion a then this excess 
rate scales like l/log(l + P) for all SNR. 

VI. Examples and Illustrations 

This section provides specific examples and illustrations 
of the bounds developed in Section IIV-BI for i.i.d. sampling 
matrices. We first highlight various aspects of the bounds using 
sources characterized by a single distribution function F. We 
then show how the bounds derived for these sources can be 
applied to a more general source characterized by a set of 
distribution functions T . The properties of the distributions 
used in this section that are needed to compute the bounds are 
described explicitly in Appendix iDl 

A. Comparison of Lower Bounds 

To begin, we consider a vector source X(0,F) character- 
ized by a zero mean Gaussian distribution. The properties of 
this distribution needed to compute the bounds in this paper 
are given explicitly in the following example. 



Example 1. Suppose that F is a Gaussian distribution with 
zero mean and variance a 2 F . Then, the mean, variance, and 
entropy of the ,3-truncated distribution Fp are given by 





rpo 2 F 



h(Fp) = ± [logger 2 ,) +rp] 

respectively, where rp = l—(tp/0)(2/Tr) 1 ^ 2 cxp(— i|/2) with 
tp = Q-\^\ and Q(x) = f™(2ir)~ 1 / 2 exp(-x 2 /2)dx. 

Since the density of the Gaussian distribution is flat and 
continuous around x = 0, the decay rate is L = 1 and 
the power P(0, Fp) scales like (3 2 for small (3. Applying a 
Taylor expansion to the expression for op given in the above 
example gives the more precise characterization 
1 



lim ^-P(O,F ) = - 

p~>o /3 2 y ' 11 6 



0. 



The bounds in in Propositions |4j [5] and [6] are shown in 
Figure 2 which plots the distortion sampling rate function a(p) 
as a function of the sampling rate p at high SNR (50 dB). Also 
shown is a corresponding upper bound from the companion 
paper l26l . 





0.9 




0.8 




0.7 


c 


0.6 


B 




O 


0.5 


1 






0.4 


5 


0.3 




0.2 




0.1 








\ \ 




\ 1 

\ I 
\ 1 


Upper Bound 1261 


\ 1 


Lower Bound 


11 
I 1 
t 1 


(Propositions [5] & [6)i 






Lower Bound 1 1 




(Proposition [4} \ 

s 




\ 

i 
i 

.... i i i i 





10 10 10 10 10 10 10 

Sampling Rate p 

Fig. 2. High SNR (50 dB) comparison of bounds on the achievable sampling 
rate distortion region for a zero mean Gaussian source with Q, = 10~ 4 . 

Figure |2] shows that Propositions and [6] are significantly 
stronger than Proposition @] and are relatively close the upper 
bound, especially for small (but nonzero) distortions. Although 
the Proposition [5] is strictly stronger than Proposition [6] the 
difference between the bounds too small to be discerned. 
Together, the upper and lower bounds show that the slope of 
a(p) is very steep over a range of p. Hence in certain settings, 
a small increase in the sampling rate provides a relatively large 
increase in accuracy. This behavior is qualitatively consistent 
with the noiseless setting shown in Section [HI] 

One aspect that is difficult to see in Figure|2]is what happens 
when a is small. This setting is illustrated more clearly in 
Figure [3] which compares Proposition [6] Theorem |4] and the 
upper bound from j26l at a lower SNR (0 dB). As a becomes 
small, Theorem |4] is significantly stronger than Proposition [6] 
and is comparable to the upper bound. Together, the upper and 
lower bounds show that if a is small (relative to the SNR) then 
a small decrease in a requires a relatively large increase in p. 
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Fig. 3. Low SNR (0 dB) comparison of bounds on the achievable sampling 
rate distortion region for a zero mean Gaussian source with SI = 10~ 4 . The 
bound for Proposition|5]is essentially the same as for Proposition [6] the bound 
in Proposition [4] is significantly weaker. 



B. Low Distortion Behavior 

Next, we consider a source X(£l,F) characterized by a 
continuous uniform distribution. 

Example 2. Suppose that F is a continuous uniform distri- 
bution with mean fxp > and variance a p. Then, the mean, 
variance, and entropy of the /3-truncated distribution Fp are 
given by 



/iPg = max 
a 



Fp r 1 u F 

h(Fp) = ±log(12f3 2 a F ) 



respectively. 



The support of the uniform distribution is bounded away 
from zero if and only if p? F > 3cr 2 , and the decay rate is thus 



L 



1, if n 2 F < 3a% 
0, if /x| > Zap 



A comparison between the cases fip/crp 

4 



2.3 and 

fip I op — 4 is shown in Figure |4] which plots the bounds given 



Theorem |4] and l26l using a log log scale. The low distortion 
behavior shown in Figure [4] is consistent with Proposition [8] 
and demonstrates the impact of the decay rate on the scaling 
of p(a) as a becomes small. 

C. Bounds for General Sources 

For our last example, we consider a more general source 
that corresponds to a set of distributions obeying certain con- 
straints. In particular, we consider the source F(j], 7)) 
where, for any parameters r\ € [0,1] and 7 € (0, 00), we define 
T{r\, 7) C .Fo to be the set of all distributions with power 7 
and a lower bound y/ffy on the magnitude of any realization. 

The constraints imposed by this source correspond directly 
to the assumptions typically used in literature for exact recov- 
ery GO, Q2-0S1, ED, GQ), ED- However, one difference 
between this paper and previous work, is that our framework 
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Fig. 4. Low distortion bounds on the achievable sampling rate distortion 
region for a continuous uniform source with f2 = 10~ 4 . 



allows us to bound this source explicitly in terms of sources 
characterized by a single distribution. More precisely, we 
take advantage of the simple fact that the sampling rate 
distortion function of X(Q, F(r], 7)) is lower bounded by the 
sampling rate distortion function of X(fl, F) for any candidate 
distribution F g J 7 ^, 7). 

In the following examples, two different candidate distri- 
butions are provided. Each of these distributions has power 
7 and a lower bound b on the magnitude of any realization 
and is therefore an element of the set F(r], 7) provided that 
b > Jryy. 



Example 3 (Point-Mass). Suppose that F is the distribution 
function of a discrete random variable with probability mass 
function 



p{x) 




if x 2 = [7 
if x 1 = b 2 



(l-e)b 2 ]/e 



where b 2 € [0,7] and e € (0, 1). Then, the mean and variance 
of the /3-truncated distribution Fp are given by 




(1-^(1-6) 



h-b 2 ) 



(3 > 1 
13 < I 



respectively. 



Example 4 (Sliced-Gaussian). Suppose that F is the distribu- 
tion function of a random variable X = Z + sgn(Z) b where 
b 2 G (0, 7) and Z has a zero mean Gaussian distribution with 
variance a 2 satisfying 

^ 2 2ba z 



(2/tt) 



<j = 7. 



Then, the mean, variance, and entropy of the /3-truncated 
distribution Fp are given by 

IXFp = 

a% = b 2 + rjio 2 +fp2ba z 
h(F ) = \[\og{2i,f3 2 ( T 2 z ) + rp\ 

respectively where fp = (2/7r)~ 1 / 2 [l — exp(— ti/2)]/ j3 and 
rp and tp are defined as in Example Q] 
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To lower bound the sampling rate distortion function of the 
source X(£l, J-(r), 7)) we apply Theorem|4]to the two sources 
characterized by the above distributions. The corresponding 
bounds are shown as a function of the distortion a in Figures 
[5] and [6J the maximum of the bounds is shown as a function 
of the SNR in Figure Q] In all cases, the bounds for the point- 
mass distribution correspond to the limit e — > 0. Also shown 
are upper bounds from l26l . 

One difference between the two candidate distributions is 
that the entropy power of the sliced-Gaussian source increases 
with the SNR whereas the entropy power of the point-mass 
source is zero. Not surprisingly, the sliced-Gaussian provides 
a significantly stronger bound for the high SNR (50 dB) 
setting shown in Figure [5] In this case, our bounds support 
the intuition that the difficulty of estimation in the high SNR 
setting is due to entropy of the nonzero values. 

Another difference between the two candidate distributions 
is that the power of the /3-truncated point-mass source is 
significantly larger than the power of the /3-truncated sliced- 
Gaussian source for small (3. As a result, the point-mass 
distribution provides a significantly stronger bound in the low 
SNR (-20 dB) setting shown in Figure [6] In this case, our 
bounds support the intuition that the difficulty of estimation 
in the low distortion setting is due to the size of the nonzero 
elements. 
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Appendix A 
Proofs for Arbitrary Sources 

This appendix gives the proofs of Proposition [T] and The- 
orem Q] For the lower bounds, we assume that the sparsity 
pattern is a random set S distributed uniformly over all subsets 
of size k. Since the probability of error for the vector source 
X(Vl) corresponds to the worst case prior distribution on x, 
this assumption provides a valid lower bound. 

We provide the following comments on notation: For sets s, 
u, we use s\u to denote the difference set {s G s : s ^ u}. We 
assume throughout that any sparsity pattern s belongs to the 
set of all subsets of {1, 2, • • • , n} of size k where k = [fl-n\. 
Also, for any matrix M, we use the notation 1Z(M) to denote 
the range space M. 
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A. Proof of Proposition [7] 

Upper Bound: Let x be a fixed vector with sparsity pattern 
s of size k, and let A be a random k + lxn sampling matrix, 
drawn independently of x, with elements i.i.d. Gaussian. Given 
the set (Ax, A, k), the estimator knows that the true sparsity 
pattern must be an element of the set 

5(A) = {s' : Ax G 7£(A S ')}. (32) 

We observe that for any sparsity pattern s' 7^ s, 

Pr{Ax g K(A S ,)} = Pr{A u x u G ft(A s ,)} = 

where u = s\s', since the (fc+l)-dimensional Gaussian vector 
A u x u is independent of the fc-dimensional linear subspace 
defined by 1Z(A S >). Thus, with probability one, 5(A) = {s} 
and exact recovery is attained. 

Lower Bound: Let x be a vector whose sparsity pattern S 
is distributed uniformly over all subsets of size k and whose 
nonzero values {cc^gs are chosen arbitrarily, and let A be 
a k x n random matrix, drawn independently of x, whose 
entries are i.i.d. with zero mean and unit variance. Given the 
set (Ax, A, k), the set of admissible sparsity patterns is given 
by the set 5(A) defined in 02} . 

If the elements of A are continuous random variables, then 
it is straightforward to see that 

Pr{rank(A s ) = m} = 1 for all s, 

and thus |5(A)| = (?) almost surely. In this case, estimation 
of S is equivalent to selecting an estimate S uniformly at 
random, and using a version of Hoeffding's inequality for sam- 
pling without replacement ||46l shows that d(S, S) — > 1 — f2 
in probability as n — > 00. 

However, if the elements of A are discrete random vari- 
ables then there may exist sparsity patterns s for which 
rank(A s ) < k n . The existence of such sets means that it may 
be possible for the estimator to discard certain sparsity patterns 
from consideration, and thus outperform the random guessing 
estimator outlined above. 

To show that the effects of discrete distributions discussed 
above are insignificant in the high dimensional setting, we 
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define 



|s : rank(A s ) < fc| 



to be the fraction of degenerate submatrices of a realization 
A = A, and observe that the admissible set S(A) is lower 
bounded by 



\S(A)\> (l-<y(A)) 



fey' 



Furthermore, the distortion of any estimate S can be lower 
bounded by considering the case where S is uniform on S(A) 
and the sparsity patterns s ^ S(A) happen to be the sparsity 
patterns with the the greatest distortion d(S,s) from the true 
sparsity pattern S. This argument gives the lower bound 

E[d(S, S)|A = A] > 1 - fl - j(A). 

for any estimator S. 

Next, using the linearity of expectation gives 

1 



E[ 7 (A)] = E 



l(rank(A s ) < fc) 



= Pr{rank(A [fc] ) < fc} 

where A^ denotes the first fc columns of A. Using only 
the fact that elements of A are independent and have finite 
and nonzero second moments, it is possible to show (see for 
example Theorem 2.1 in 03) that Pr {rank(A [/c ]) < fc} -> 
as n — > oo. Hence, we conclude that 

E[d(S,S)] -> 1 - ft as n->oo. 

for any i.i.d. sampling matrix. By Markov's inequality, this 
convergence in expectation is sufficient to prove that the 
probability of error with respect to distortion a < 1 — f2 is 
bounded away from zero for any sampling rate p < ft. 

B. Proof of Theorem Q] 

Upper Bound: Let x be fixed vector with sparsity pattern 
s of size fc, and let B be a random m x n matrix, drawn 
independently of x, with elements i.i.d. Gaussian and m = 
\p ■ n\ where p < 0. Additionally, for any e > 0, let U be 
distributed uniformly over all subsets of {1, 2, • ■ ■ , n} of size 
[[1 — (1 — e)p/Q] ■ n\, independently of x and B, and let A 
be the m x n random matrix whose columns obey 

_ jo, if ie U 
1 (B 2 , ifz^U 

where denotes a column vectors of zeros. 

Given the set (Ax, A, fc), suppose that the estimator per- 
forms the following two-stage procedure. First, the estimator 
identifies the smallest integer fco such that Ax <G TZ(A So ) 
for some subset Sq C {1,2,- •• , n}\U of size fco- If there 
exist multiple subsets of size fco satisfying this condition, the 
estimator declares and error. Otherwise, the estimator begins 
with the unique set Uo and constructs an estimate S be 
selecting fc — fco additional indices uniformly at random and 
without replacement from U. 



To show that the procedure outlined above achieves the 
upper bound, we define the event 

£ = {|I| 8 \U|-(l-e)p| <e 2 -p\ 

and note the Pr{£ c } — s- as n — s- oo by Hoeffding's inequality 
for sampling without replacement ll46l . Conditioned on the 
event £, the number nonzero elements in xu<= is less then m, 
and the same arguments we used in the proof of Proposition 
[U show that with probability one, so = s\U is the unique 
smallest subset for which Ax G 1Z(A S0 ). 

Thus, conditioned on the event £, the expected distortion of 
the estimate S is due entirely to the number of errors in the 
\s n U| remaining elements drawn from U, and can be lower 
bounded as 



E[d(s,S 



E[(fc-|s\U|)(l- 
<(l-£)(l-f2)+e'(e) 



Ms\U| I 
|U| 



£ 



where e'(e) — > as e — > 0. Using Hoeffding's inequality for 
sampling without replacement ll46l we conclude that for any 
p < O and e > 0, there exists a sequence of random \p-n\ xn 
sampling matrices A such that 

Pr {d{s, S) > (1 - - n) + e} -> 

which completes the proof. 

Lower Bound: Let x be a vector whose sparsity pattern 
is a random set S is distributed uniformly over all subsets 
of size fc and whose nonzero values {xi}i e s are chosen 
arbitrarily, and let A be a random m x n sampling matrix, 
drawn independently of x, with an arbitrary distribution with 
E[tr(A T A)] < oo and m < fc. Furthermore, suppose that in 
addition to the set (Ax, A, fc), a "genie" provides the estimator 
with a subset UCS that indexes |U| = rank(As) linearly in- 
dependent columns of the matrix A. Since TZ(Au) = TZ(As), 
the set (U, A, fc) is a sufficient statistic for estimation of S. 

To lower bound the performance of the setting outlined 
above, we define the following quantities for any realization 
A = A and set s: 

r s := rank(A s ) 

t s := |ies c : Ai G U(A S )\. 

If rs = fc, then U = S and exact recovery is achieved. 
However, if rs < fc then the estimator must choose fc — rs 
additional indices corresponding to the set S\U. Since the 
estimator knows the range space IZ(As), it may exclude from 
consideration any index i for which Ai ^ TZ(As). Thus, the 
number of admissible indices is given by fc — rs + is- Since it 
is impossible to distinguish which elements in the admissible 
set correspond to S, the expected distortion of any estimate S 
is proportional to the fraction ts/(k — rs + ts) and is given 
by 



E[d(S,S)|S] = 



1 ts • (fc - r s ) 



(33) 



fc i s + (fc-rs) 

To evaluate the distortion given in ( f33l > in terms of the 
random sparsity pattern S, we need the following result. 
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Lemma 3. For any sampling matrix A, 

E[i s ] > (n/k- l)(fc-E[r s ]). 



(34) 



Proof: Let i be a random index distributed uniformly over 
the set S c and define 

Pap = Pi{A i ^TZ(As)} 
= Pr{r SU i = r s + 1} 
= E[r Sui ] -E[rs]. 

Additionally, let V be distributed uniformly over all subset of 
{1, 2, • ■ ■ , n} of size fc + 1, let j be distributed uniformly over 
V, and define 

= Pr{rv\j =»V-1} 
= E[rv]-E[r v \,-]. 

Since S U i and V\j are equal in distribution, 

E[r v ] - E[r s ] 



Pup 



(35) 



Moreover, for any set V there exists at most rv indices j€ V 
for which r v \j = rv — 1, and hence 



< 



Combining d35l l and (f36t gives 



and 



![rv] < 



Pup 



E[r s l 



< 



E[r s ] 



Noting that (n — — p up ) is equal to the left hand side of 
(l34l completes the proof. ■ 
Next, for any e > 0, we define the events 

8 X = (|d(S,S) -E[d(S,S)|S]| < e} 
£ 2 = {jrs-E[r s ]| < e • (k - E[r s ])} 
fa = {is > (l-e)-E[ts]}. 

Conditioned on fin^n^ the distortion g£(S, S) can be lower 
bounded as follows: 



d(S,S) 



> 



> 



(37) 



(1-6) 



(38) 

(39) 
(40) 



(E[t s ]) 1 + (fc-E[r s ])" 

> (l-E[r s ]//c)(l-/c/ri)-2e 

> (1 - min{m, fc}/fc)(l - k/n) - 2e 

where ( f3Tb follows from E\ and ( L33l >; d38l follows from £2 
and £3; ( f39l > follows from Lemma [2 and <l40b follows from 
the fact that rg < min{m, k}. Hence, for any e > 0, and 
sampling rate p < fi, the distortion obeys 

3 

Pr {d(S, S) < (1 - £)(1 - fi) - 2e} < 2 Pr{^}. (41) 



To conclude, we show that the right hand side of ( f4Tb 
is bounded away from one for all n. For the event £1, the 
convergence Pr{£j} — > as n —> 00 follows directly from 
Hoeffding's inequality for sampling without replacement l46l . 
To handle the event £2, we use the following generalization 
of McDiarmid's inequality to sampling without replacement 
given in 1481 . 

Lemma 4. Let Z\,Z2,---,Zk be a sequence of random 
variables, sampled from an underlying set Z of n elements 
without replacement, and let <j> : Z k 1— > 1Z be a symmetric 
function such that for all i € {1,2,-- - , k} and for all 

Z\ , Z2, ■ ■ ■ , Zk G Z and z[ , z' 2 , • ■ ■ , z' k G Z, 



\<f>( z i> 



z k ) - 4>(zi, • • • , Zi-i, z't, Zi+i, • • • , zu)\ < c. 



Then, for all e > 0, 

Pr { |0(Z) - E[0(Z)] I > e} < 2 exp 

_ k(n — k) 



-2e 2 



j(k, n — fe)c s 



where 7 (fc, n — k) = • ^_ in2waL ^ n _ ky 

To apply Lemma |4] to our setting we let Z = {1, 2, • • • , n} 
and <f>(z) = rank(A z ). Since replacing any index in z can 
change the rank of A z by at most one, c = 1, and thus, 
(36) Pr{£f} -> as n -t 00. 

Lastly, consider the event £3. Ideally, it would be preferable 
to show that Pr{£f} — > and thus conclude that the right 
hand side of (Put tends to zero and n — > 00. However, 
such convergence appears to be nontrivial, and for the current 
version of the proof we instead show that Pr{£?} is bounded 
away from one. To do so, we apply Markov's inequality to the 
positive random variable n — k — ts to attain 



Pr{£l}<l- 



eE[i s 



n — k 

< 1 - e • (1 - E[rs] /k) 

< 1 - e • (1 - m/k) 



where the last two steps follow from Lemma [3] and the fact 
that rs < m. Thus, we have shown that 



lira sup ^Pr{£,f} < 1 



which completes the proof. 

Appendix B 
Proofs for Sources with a Distribution 

This appendix gives the proofs of Propositions 3-6 and 
Theorems 2-4. Each of these results provides a lower bound on 
the sampling rate distortion function p(a) of a vector source 
X(Q,F) characterized by a single distribution F. However, 
instead of analyzing the source X({l,F) directly, we find it 
convenient to consider the following stochastic analog. 

Definition 9. For each integer n the stochastic vector source 
X s (Jl,F) outputs a random vector X G M." whose sparsity 
pattern is distributed uniformly over all sets of {1, 2, • • • , n} 
of size [fi -n\, and whose nonzero elements (Xj)igs are i.i-d. 
with distribution function F. 



13 



We use the same definition of achievability for the stochastic 
source X S (H,,F) as for the deterministic source, except that 
the probability of error is also taken with respect to the 
random vector X. Based on the fact that any sequence of 
random vectors {X^} drawn from X S (Q, F) is almost surely 
an element of X(Cl, F), we immediately obtain the useful 
property that the sampling rate distortion function of X(fl, F) 
is lower bounded by that of the stochastic version X S (Q,,F). 

We provide the following comments on notation: For a 
square matrix M we use \M\ to denote the determinant. To 
denote elements of sequences such as {Y(")} or {AW} we 
simply use Y or A when the dependence on n is clear. For any 
integer p, we use I p to denote the pxp identity matrix. For sets 
s, u, we use s\u to denote the difference set {s € s : s ^ u}. 
We use to denote the set of all subsets of {1, 2, • • • , n} of 
size [0 • nj . 

A. Proof of Proposition \3\ 

We begin with the following lemma which shows that 
asymptotically reliable recovery is impossible if the mutual 
information between the sparsity pattern S and the samples Y 
is too small. The proof is given in Appendix IB-GI 

Lemma 5. For any stochastic vector source X s (fl,F) and 
sequence of sampling matrices {A^}, the sampling rate 
distortion pair (p, a) is not achievable if 

limsupiE A /(S; Y) < R(Cl,a) (42) 

n— >oo 

where R(ft,a) is given by ©. 

The next step is to upper bound the left hand side of (l42l 
uniformly for any sequence of sampling matrices obeying the 
normalization (0. To do so, we consider a given problem size 
n and condition on a realization of the sampling matrix A = 
A. Based on the definition of X S (Q,, F), S -> AX -> Y forms 
a Markov chain. Thus, by the data processing inequality, 

7(S;Y)<J(AX;Y). (43) 

One way to upper bound the right hand side of d43l is to 
observe that 

I (AX; Y) < max I(Z; Z + W) (44) 

Ej|Z|| 2 <V(n,.F)tr (AA 3 ) 

where the maximization is over all random vectors Z satisfying 
the same average power constraint as AX — E^4X. This 
information term is maximized when Zi are i.i.d. Gaussian 
(see e.g. 11491 ). and thus 

I(AX; Y) < f log (1 + ±V(n, FMAA T )) . (45) 

For any random matrix A, Jensen's inequality gives 

E A log(l + ir(aF)tr(AA T )) 

<log(l + ir(a^)E A tr(AA T )). 

Hence, for any any sequence of random matrices satisfying 
the power constraint E^-tr(AA T ) = 1, we conclude that 

limsuplE A /(S;Y) < plog (l + V(Sl, F)) . 

n— >oc 

Combining this result with Lemma [5] concludes the proof of 
Proposition [3] 



B. Proof of Proposition 

For this proof, we begin with Inequality d43l from the proof 
of Proposition [3] and derive an improved upper bound on the 
information I(S; Y). In particular, observe that 

I(AX;Y) < max 1(1; Z + W) (46) 

EZZ T = V(n, F)AA T 

where, this time, the maximization is over all random vectors 
Z with covariance exactly equal to that of AX. The right hand 
side of d46T ) is maximized when Zi are jointly Gaussian with 
covariance AA T (see e.g. Il49l ), and thus 

I(AX;Y) < |log|/ m + V(Q, F)AA T \. (47) 

For any random sampling matrix A, the expected value of 
the right hand side of ( |47] i may be difficult to compute in 
general. However, a central result from random matrix theory 
is that if the elements of A are i.i.d. with zero mean and 
variance 1/n, then the empirical distribution of the singular 
values of the random sequence {A 1 -™'} converge almost surely 
to a non-random limit given by the Marcenko Pastur law. This 
convergence leads to the following result. 

Lemma 6. Let M denote an m x n random matrix whose 
entries are i.i.d. with zero mean and unit variance. Ifm/n—>r 
as n — > oo, then 

lim ^Elog|/ m + 7 iMM T | = 0(r, 7 ) (48) 

where Q(r,j) is given by H7\ . 

Proof: Almost sure convergence is shown in 1 50 1 . The 
extension to convergence in expectation is straightforward 
using, for example, Hadamard's inequality and the dominated 
convergence theorem. ■ 
Using Lemma [6] gives 

lim ±E A log\l m +T(n,F)AA T \=G(p,r(n i F)). 

Combining this result with ( l47b and Lemma [5] completes the 
proof of Proposition [4] 

C. Proof of Propositions\5\and\6\ 

The shortcomings of Propositions [3] and [4] are due, in part, to 
the fact that the data-processing inequality d43l is not tight in 
general. In this proof, we begin with the information 7(S; Y) 
and derive a stronger upper bound that takes into account the 
fact that values of the nonzero elements are unknown. 

Using the chain rule for mutual information, I (AX., S; Y) 
can be written two ways as 

I(S, AX; Y) = J(S; Y) + I(AX; Y|S) 

= I(S;Y\AX)+I(AX;Y). 

Since S — ^ AX — s- Y forms a Markov chain, the term 
7(S; Y|AX) is equal to zero and thus 

7(S;Y) =I(AX;Y) - I(AX;Y\S). (49) 

Intuitively, one may think of the the information I (AX; Y|S) 
as quantifying the amount of I(AX; Y) that is "used up" 
describing the values of the nonzero elements, and hence can 
not contribute to estimation of the sparsity pattern. 
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Since the asymptotic limit of /(AX; Y) is upper bounded 
in the proof of Proposition [4] the remaining challenge is 
to find a non-trivial lower bound on the asymptotic limit 
of 7(j4X;Y|S). To do so, define the fc-dimensional vector 
Z = Xg, and observe that 

I(AX; Y|S) = JL £ JL4 S Z; A S Z + W). 

If A is a random matrix whose elements are i.i.d., then each 
submatrix A s is identically distributed. Hence, by the linearity 
of expectation, 

E A /(AX;Y|S) =E B /(BZ;BZ + W) (50) 

where B is an m x k matrix whose elements have the same 
distribution as the elements of A. 

In the remainder of the proof, we lower bound the asymp- 
totic limit of the right hand side of (T50b . We first consider the 
special case where the distribution of the nonzero elements is 
Gaussian, and then we consider the general case. 

Gaussian Distributions: Suppose, that the nonzero ele- 
ments are i.i.d. Gaussian with mean p, and variance a 2 . Condi- 
tioned on any realization B = B the samples Y = £> V + W 
are Gaussian with co variance I m + a 2 BB T and hence 

I(BV; BV + W) = i log |J m + <t 2 BB t \. 

Using Lemma [6] gives 

lim J-E B log|/ m + cr 2 BB T | = QQ(p/fl, Qa 2 ). 

which completes the proof of Propositions 

General Distributions: Unlike the Gaussian setting, it does 
not appear possible in general to give an exact expression 
for the information /(BV; BV + W). To lower bound this 
term, we first consider the case m < k. Conditioned on any 
realization B = B, 

/(BV; BV+W) = h(BV + W) - f log(27re) 

where h(-) denotes differential entropy |49l . 

If we define the entropy power of an n-dimensional random 
vector X to be 

N(X) : =^exp(^(X)), 

then two applications of the entropy power inequality (see e.g. 
ED) give 

/(BV; BV + W) 

= f log (27reiV(BV + W)) - f log(27re) 

> f log (2ire[N(BV) + N(W)}) - f log(27re) 
= f log (1 + N(BV)) 

> f log(l + JV(tfi)|flB T |£). (51) 
Next we consider the case m > k. For any realization B = 

B, 

I(BV; BV + W) = /(V; V + B f W) 



where B^ denotes the Moore-Penrose pseudoinverse of B. 
Hence, following the same steps as for the case m < k, we 
may conclude that 

/(BV;BV + W) > %log(l + N(U l )\B T B\*). (52) 

Although the expectations, with respect to a random matrix 
B, of the right hand sides of ( BTT l and d52l are difficult to 
compute for finite dimensions, their asymptotic behavior can 
be characterized using the Marcenko Pastur law. A proof of 
the following result is given in the paper fl5T1 . 

Lemma 7. Let M denote an m x n random matrix whose 
entries are i.i.d. with zero mean and unit variance. Ifm/n—>r 
as n — > oo with r > 1, then 




almost surely. 

Combining Lemma [7] and the bounds dBTb and d52l . it can 
be shown that 

limmfE B /(BV;BV + W) > V{p/Sl, SW([/i)) 

n— ¥ oo 

which completes the proof of Proposition [6] since Vh(Q, F) = 
QN(Ui). 

D. Proof of Theorem [2] 

In this proof, we begin with the lower bound for the noisy 
setting given in Proposition [6] and take the limit as the SNR 
tends to infinity. For notational convenience, we use V to 
denote the variance V(Cl, F) given in Definition [5] Hence, 
the entropy power, given in Definition [7] can be expressed 
Vh(fl,F) = Q(il,F)V and a lower bound for the noiseless 
setting corresponds to the limit of (f2Qb as V — » oo. 

To begin, it is straightforward to verify that 

lim V(r, 7) — 7-log(7/e) = £log(A(r)) 

7— f 00 

for r < 1. Likewise, it can be shown (see Example 2.15 on pg. 
44 of |50l ) that the above statement is also true if the function 
V(r, 7) is replaced with Q(r, 7). Thus, for any p < il, the 
difference between the left and right hand sides of ( fSOt obeys 

lim g( p , v) - R(n, a) - nv{p/Q, e(o,, f)v) 

V— >oo 

= f log(A(p)) - R(n, a) - f log (0(0, F)A(p/n)) 

which completes the proof. 

E. Proofs of Theorems \3\and^\ 

These proofs are based on the following "genie" argument: 
we first suppose that a genie provides the estimator with the 
locations and values of some fraction of the nonzero elements, 
and we then characterize the number of samples required to 
identify the remaining elements (up to allowable distortion a). 

To be more precise, suppose that for each index i € S, a 
genie reports the pair (i,Xi) to the estimator automatically if 
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> T and with probably 7 if X, = T where T and 7 are 
chosen such that probability of reporting is equal to 1 — /3. We 
define U C S to be the set of reported indices and use Xtj to 
denote their corresponding values. In this setting, the number 
of elements that are reported is a random variable B = |U| 
with KB = (1 — f3)k and the elements of Xu are i.i.d. with 
the /3-truncated distribution Fp given in Definition [6] 

Conditioned on any realization B = b, the values Xxj of 
the reported elements are, by construction, independent of the 
reported indices U, the unreported indices S\U, the unknown 
values Xs\u an d the noise W. Hence, (Y, U, Xu) (Y — 
v4uXu,U) -> S\U forms a Markov chain. 

Moreover, conditioned on any realization U = u with 
|u| = 6, the set of unreported indices S\U is distributed 
uniformly over all ClZh) possibilities, and the unreported 
values X S \u are independent of S\U and i.i.d. with the f3- 
truncated distribution given in Definition [6] 

Using the above observations, we may conclude that esti- 
mation of S\U given the set (Y, U, Xu) is equivalent to a 
modified version of the original estimation problem. In this 
new version, the vector length is n — B, the sparsity is k — B 
and the samples are given by 



W 



where U c = {1, 2, • • • , n}\U. Furthermore, if 1 — B/k > a, 
then it is straightforward to verify that a distortion a for the 
original problem corresponds to a distortion a = a /{I — B/k) 
for the new problem. 

Since the information provided by the genie cannot make 
the estimation problem more difficult, any lower bound on the 
number of samples needed for the new problem is also a lower 
bound for the original problem. Since lim n ^.oo B n /k„ = 1-/3 
almost surely by the law of large numbers, the sparsity rate of 
the new problem is asymptotically il = /3fi/(l — (1 — /3)0) 
and the distortion is asymptotically a = a/ (3. Also, since the 
bounds in Propositions [3] and [6] are continuous with respect 
to these parameters, they may be used to lower bound the 
sampling rate distortion function of the new problem. 

We first consider the general lower bound given in Propo- 
sition [3] Observe that the average power of the new sampling 
matrix Au<^ is related to the average power of the original 
sampling matrix A by 

m 

Etr(AuoA Uc ) = 5^El(i e U c )||Ai|| 2 

i=l 

= "-d-^ EtrfAA 7 ), 

and hence the condition iEtr(AA T ) = 1 implies that 
^Etr(Au=A Uc ) = 1 - (1 - @)n. Accounting for this altered 
matrix normalization and applying the bound in Theorem [3] 
shows that the sampling rate p = m n /h of the new problem 
must satisfy 



2R(n,a) 



log(l+r(/3Q,*») 



Taking the maximum over all choices of f3 concludes the proof 
of Theorem [3] . 



Next, we consider the setting of Theorem [6] Applying the 
bound in Theorem [6] shows that the sampling rate p = m n /h 
of the new problem must satisfy 

g(p, r(/3Q, Fp)) > R(n, &) + nv(p/n, r h (pn, Fp)) . 

Taking the maximum over all choices of (3 concludes the proof 
of Theorem [4] 

F. Proof of Lemma Q] 

Let k = \ Vl-n\ and note that Sl\ has cardinality (?J. For any 
s G Sq, a simple counting argument shows that the number 
of subsets s' <G S£ with d(s, s') < a is given by 



LafcJ 

E 



n — k 
a 



(54) 



Hence, N n (n,a) > {T)/N n . To characterize the limit of this 
term we use the following fact which can be found in l49l . 

Lemma 8. If k/n — > p as n 



00 for some < p < 1 then, 
= H(p) 



lim - log ( n 

n->cx> n \k 

where H(p) is binary entropy. 
Applying Lemma [8] shows that 



Uminf-logiV„(0,a) > R(fl,a). 

To show the upper bound, we use a random covering 
argument. Let S be a random subset of 5 t ™ with r£\ / N n + 2n 
elements chosen uniformly at random. For any s g SQ, the 
probability that there does not exist s' E S with d(s, s') < a 
is given by 



N n /(D) J <cxp(-2n). 



Applying a union bound shows that the probability that S does 
not cover 5 t " is upper bounded by (T)p n which is strictly less 
than one. Hence, this proves that N n (fl, a) < (fy/Nn + 2n. 
Taking the limit and applying Lemma [8] shows that 

limsup-logiV„(0,a) < R(Q,a). 

n— yoo Tl 

G. Proof of Lemma [5] 

This proof follows the proof of Fano's inequality given in 
||49l with some modifications to handle our error criterion. 

For a given problem size n, realization A = A, and estima- 
tor x.(k,y,A), define the error event £ = {d(X, X) > a}. 
Using the chain rule for entropy J49), H(£, S|Y) can be 
written two ways as 

H(£,S\Y) =H(S\Y) + H(£\S,Y) 
= H(£\Y)+H(S\£,Y). 

The entropy H(£\S, Y) is equal to zero since £ is defined by 
S and Y. Also, since conditioning reduces entropy, H(£\Y) < 
H(£) < log 2 and 

H(S\£,Y) < H(S\£) 

= Pr{£}H(S\£) + Pr{£ c }H(S\£ c ). 
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On the event £ c the entropy of S is trivially upper bounded 
by a uniform distribution on all possible sparsity patterns. 
Hence H(S\£ C ) < H(S) = log(£). On the event £ c , the 
entropy of S is upper bounded by a uniform distribution on 
the set of all sparsity patterns with distortion less than or equal 
to a. Hence H(S\£) < logiV n where N n is given by (l54l i. 

Combining all the inequalities and using the fact that 
J(S; Y) = H(S) - H{S\Y) shows that the probability of 
error with respect to any random matrix A is lower bounded 
by 



case 8 = 2a: 



E A Pr{£} > 1 



E A /(S;Y)+log2 

\ogN n {n,a) • 



Taking the limit asn^oo and applying Lemma Q] completes 
the proof. 



Appendix C 
Proofs of Scaling Bounds 

A. Proof of Lemma [2] 

Without loss of generality we assume that F has unit power. 
Also, we define 

T(p) = r 2L P(n 1 Fp)/p(n,F). 

Since < P(Sl,Fp) < P(Sl,F) with equality on the left 
if and only if /3 = 0, we know that < t(8) < oo for 
any < (3 < 1. Hence, to prove our desired result we must 
consider the case j3 — >• 0. 

Let X be a random variable with distribution F, and define 

F x l(p) ■= inf {x > : Pr{X 2 < x} > p) 

to be the quantile function of X 2 . Using the definition of Fp, 
the function r(/3) can be expressed as 



-03) 



1 



8 2L+1 ' 



F-l{p)dp. 



Since F x \ (p) is a non-negative and non-decreasing function, 

< f F x l(p)dp < pFxltf), 
Jo 



and thus 



limsu P T(/3) < \imsup /3~ 2L F~l(l3) 



/3->-o 
1 



liminfT(S) > -^-rliminf B- iL FZ$(S). 



To conclude, we note that since the decay rate L of any distri- 

x 2 



bution FeJ must be finite, lim^^o P~ 2L F x l{B) G (0, oo). 



B. Proof of Proposition [7] 

This proof begins with the lower bound in Theorem[3] Since 
a < 1/4 we may attain a simplified bound by considering the 



max 

a<p<i 



log(l + V(8Q,F p )) 



> 



> 



2(i-(i-2 a )n)i?( 1 _ ( ^ a)n ,i) 

log(l + V(2an,F 2a )) 

n( 2af2 1\ 
rl Vl-(l-2a)0' 2/ 



log(l + y(2afi,F 2a )) 



(55) 



where the last step uses the fact that 51 e (0, 1/2). 

Next, we consider the numerator of ( |55l l. With a bit of work, 
it can be verified that 2afi/[l - (1 - 2a)fi] < 1/3. Hence it 
suffices to lower bound the function R(x, 1/2) for all ir G 
[0,1/3]. Since < R{x,l/2) < H(x) with equality on the 
left if and only if x = 0, and since lim^-^o R(x, 1/2) = 1/2, 
there exists some constant C > such that 



R 



,l-(l-2a)fi' 2/ — ° U-(l-2a)fl^ 



> C-H(aCl) 

>c-an\og{ a n). 



(56) 



Lastly, we consider the denominator of ( f55l ). Using the fact 
that V(n,F) < P(£l,F), Lemma|2] and the concavity of the 
logarithm gives 

log (1 + V(2an, F 2a )) < log(l + 2aP{tt 1 F 2a j) 
<\og(l + C+(2a) 2L+1 P) 



<C7+2 2L+1 log(l + a 2L+1 P) 



(57) 



Combining ( T55l l. (T56b and ( T57l ) completes the proof. 



C. Proof of Proposition [9] 

This proof follows directly from Proposition [6] Combining 
the noiseless bound d23l ) with the assumption (f30b shows that 
the noiseless sampling rate distortion function is po(ct) = ^, 
and hence p(a) > fl for all P. Combining this fact with the 
simplified bound d22b gives 



2R(n, a) + fllog (1 + e^Vhi^F)) 



io g (i + v(n,f)) 

Next, we note that since < e~ 1 V h (tt, F) < V(Q, F), 

l + V(Q,F) V(Q,F) e 

l + e-H^fi.F) " IFW^F) ~ 6(n,F)' 
Starting with (l58l l we can write 

\ g(i + v(n, F)[ P (a) -n] 

i + v(n,F) 



(58) 



(59) 



> 2R(VL,a) -Olog 

> 2R(Q,a) - Q log 

> fl(0,a) 



(60) 
(61) 



where (l60l l follows from (|59] l and (|6TT > follows from (l30l >. 

Since a < 1/4, we have R(Q,a) > R{Vt,l/A). Addition- 
ally, since < R(Cl, 1/4) < H(Sl) with equality on the left if 
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and only if = 0, and since limn->o R(0,,l/4)/H(Q) = 3/4, 
we conclude that there exists some constant C > such that 

R(Q,a)>C-H(fl)>mog(l/fl). (62) 

Using d62j and the fact that V(Q, F) < P(Q, F) completes 
the proof. 

Appendix D 
Example Distributions 
A. Zero-Mean Gaussian ( Example Q} 

Let X be Gaussian random variable with zero mean and 
variance a 2 . By definition, the /3-truncated distribution of F 
is given by 



V2 



: exp 



7TfJ 



2a 2 



l(|x| < atp)dx 



where tp = Q~ x (^-) satisfies Pr{|X| < crtp} = (3. Using 
integration by parts shows that the power of Fp is 

/oo 
x dFp(x) 
-oo 

2 ftp 1 



<7 
J 

P 

rpc 



1 



exp(— x /2)dx 



cx.p(-x 2 /2)dx - \l -tp exp(-i 2 /2) 



where rp is given in Example [T] The differential entropy is 
given by 

h(Fp) = C -log(^exp(^))di^) 



log (jSv^F) + ^f[ x 2 dFp(x) 



[log (2tt/3 2 



r/3 



B. Continuous Uniform (Example [2} 

Let X be a continuously uniform random variable with 
mean p and variance a 2 . The support of X is the interval 
[a, £>] with a = p — ^/3a and 6 = p + \/3<t. If we let be 
the solutions to min(&, i^) — max(a, -tjj) = (3(b — a), then 
Pr{|X| < tp} = (3. Thus, the /3-truncated distribution Fp is 
uniform over the interval [ap,bp] where ap = max(a,—tp) 
and bp = mm(a,tp), and 

fj,p = (bp + ap)/2 = max {0, [i — (1 - /3)V3cr 2 } 
cr^ = (bp - ap) 2 j\2 = /3a 2 
h(Fp) - log(6^ - ap) = ilog(12/3 2 a 2 ). 

C. Point-Mass (Example^ 

Let X be a random variable with the point-mass distribution 
given in Example [3] with parameters b, 7 and e. By definition, 
the probability mass function of the the /3-truncated random 
variable Xp is given by 

,r fl ,i- 6 )-(i- e ) ifa J = [ 7 _( 1 _ c ) & a]/ e 

if x 2 = b 2 



p(x) 
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D. Sliced-Gaussian (Example^ 

Let X be a continuous random variable with the sliced- 
Gaussian distribution given in Example [4] with parameters a 2 
and b. Let tp = Q^ 1 ^-^-) and observe that 

Pr{|X| < B + a.tp) = Vx{\Z\ < a z tp} = p. 



Thus, Fp is the distribution of Xp = Zp + sgn(Z ( g)V B where 
Zp has the /3-truncated zero mean Gaussian distribution. By 
symmetry, we have \ip = and by linearity of expectation we 
write 

a 2 = EX 2 



= E[Zp 
= B 



EZ^p 



sgn(Zp)VB) 
2\[~BE\Zp\. 

From Example Q] we know that EZ 2 = rpa 2 . To calculate 
the remaining term we write 

where fp is given in Example [5] Finally, we observe that the 
differential entropy is 



h(Fp) = h(Zp) = - [log (27T/3 2 ( 7 2 
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